Web Scraping with R & rvest

Dr. Matthew Hendrickson

July 9, 2020

Topics

  1. About Me
  2. A Little About Web Scraping
  3. Robots.txt
  4. HTML & CSS
  5. Web Scraping
  6. The Setup
  7. Scraping the Data
  8. Assembling the Data
  9. References & Resources

About Me

  • Social Scientist by Training
    • Psychology & Music %>%
    • More Psychology %>%
    • Law & Policy
  • Professional Experience (13+ years)
    • Higher Education Analyst
    • Independent Consultant
    • Research projects, data analysis, policy development, strategy, analytics pipeline solutions

A Little About Web Scraping

“Web scraping is the process of automatically mining data or collecting information from the World Wide Web.” – Wikipedia

Web scraping is a flexible method to extract data from the internet. It can involve extracting numerical or text data.

Use Cases

There are many uses for web scraping, including but not limited to:

  1. Price monitoring
  2. Sentiment analysis
  3. Time series tracking and analysis
  4. Brand monitoring
  5. Market analysis
  6. Lead generation

Robots.txt

Always ensure - PRIOR to scraping - that you have rights to scrape the website.

This is critical as you can be blocked from sites or even face legal action.

Robots.txt

Good news! You can easily check with the robotstxt package.

paths_allowed(paths = c("https://netflix.com/"))
#> [1] FALSE

This example shows that Netflix does not allow you to scrape their site.

HTML & CSS

“HTML is the standard markup language for creating Web pages.” – W3Schools

“CSS describes how HTML elements are to be displayed on screen, paper, or in other media.” – W3Schools

HTML Structure

Image credit: Professor Shawn Santo

HTML Tags

HTML is strucutred with “tags.” These tags indicate portions of the page and can be called by their structure.

There are many types of tags - here are some important ones for scraping:

  • <h1> - header tags
  • <p> - paragraph elements
  • <ul> - unordered bulleted list
  • <ol> - ordered list
  • <li> - individual list item
  • <div> - division
  • <table> - table

A Little Help with CSS

If you aren’t familiar with CSS, extracting parts of a website can be daunting.

SelectorGadget is incredibly helpful for this purpose. However, it is only available for Chrome.

Another option is to inspect the page elements, which is available for most major browsers, including Chrome, Firefox, as developer tools.

Web Scraping

Scraping Methods

HTML - syntax is easier and aligns with HTML tags

XPATH - useful when the node isn’t uniquely identified with CSS

The Setup

Set up the environment to scrape the site.

library(tidyverse)
library(robotstxt)
library(rvest)

That’s it! These are all the tools you’ll need.

Determine a website to scrape

It only seems appropriate to pull data from Amazon regarding R books

Ensure we can scrape the site

paths_allowed(paths = c("https://amazon.com/"))
#> [1] TRUE

We are good to scrape!

Setting the URL

Before you can get started, you must specific the URLs to pass to the function.

Data as of 2020-06-29.

  amazon <- read_html("https://www.amazon.com/s?k=R&i=stripbooks&rh=n%3A283155%2Cn%3A75%2Cn%3A13983&dc&qid=1592086532&rnid=1000&ref=sr_nr_n_1")

Titles

Scraping Book Titles

amazon %>% 
  html_nodes(".s-line-clamp-2") %>% 
  html_text() -> amazon_titles
head(amazon_titles)
#> [1] "\n    \n    \n        \n\n\n\n\n\n    \n        \n            \n                Mastering R for Quantitative Finance\n            \n        \n        \n    \n\n\n    \n"                                                                                                                 
#> [2] "\n    \n    \n        \n\n\n\n\n\n    \n        \n            \n                R for Data Science: Import, Tidy, Transform, Visualize, and Model Data\n            \n        \n        \n    \n\n\n    \n"                                                                               
#> [3] "\n    \n    \n        \n\n\n\n\n\n    \n        \n            \n                The Book of R: A First Course in Programming and Statistics\n            \n        \n        \n    \n\n\n    \n"                                                                                          
#> [4] "\n    \n    \n        \n\n\n\n\n\n    \n        \n            \n                R Graphics Cookbook: Practical Recipes for Visualizing Data\n            \n        \n        \n    \n\n\n    \n"                                                                                          
#> [5] "\n    \n    \n        \n\n\n\n\n\n    \n        \n            \n                Discovering Statistics Using R\n            \n        \n        \n    \n\n\n    \n"                                                                                                                       
#> [6] "\n    \n    \n        \n\n\n\n\n\n    \n        \n            \n                The Programmers Code: A Deep Dive Into Mastering Computer Programming Including Python, C, C++, C#, Html Coding, Raspberry Pi3, And Black Hat Hacking\n            \n        \n        \n    \n\n\n    \n"

The element pulls a number of breaks and blank spaces.

Let’s clean this up with str_trim.

The titles have a great deal of white space and breaks (\n), these need to be removed

amazon_titles <- str_trim(amazon_titles) # Removes leading & training space
head(amazon_titles)
#> [1] "Mastering R for Quantitative Finance"                                                                                                                 
#> [2] "R for Data Science: Import, Tidy, Transform, Visualize, and Model Data"                                                                               
#> [3] "The Book of R: A First Course in Programming and Statistics"                                                                                          
#> [4] "R Graphics Cookbook: Practical Recipes for Visualizing Data"                                                                                          
#> [5] "Discovering Statistics Using R"                                                                                                                       
#> [6] "The Programmers Code: A Deep Dive Into Mastering Computer Programming Including Python, C, C++, C#, Html Coding, Raspberry Pi3, And Black Hat Hacking"

This simple function returns cleaned text.

Formats

Scraping the Book Format

amazon %>% 
  html_nodes("a.a-size-base.a-link-normal.a-text-bold") %>% 
  html_text() -> amazon_format
head(amazon_format)
#> [1] "\n    \n        \n        \n            Paperback\n        \n    \n"
#> [2] "\n    \n        \n        \n            Paperback\n        \n    \n"
#> [3] "\n    \n        \n        \n            Kindle\n        \n    \n"   
#> [4] "\n    \n        \n        \n            Paperback\n        \n    \n"
#> [5] "\n    \n        \n        \n            eTextbook\n        \n    \n"
#> [6] "\n    \n        \n        \n            Paperback\n        \n    \n"

Clean up book format values

amazon_format <- str_trim(amazon_format)
head(amazon_format)
#> [1] "Paperback" "Paperback" "Kindle"    "Paperback" "eTextbook" "Paperback"

Price

Scraping the Book Price

The price structure splits price into two elements. We must pull each and combine them into a single price.

amazon %>% 
  html_nodes(".a-price-whole") %>% 
  html_text() -> amazon_price_whole
head(amazon_price_whole)
#> [1] "45." "39." "24." "33." "29." "23."

Scraping (the rest of) the Book Price

amazon %>% 
  html_nodes(".a-price-fraction") %>% 
  html_text() -> amazon_price_fraction
head(amazon_price_fraction)
#> [1] "36" "49" "99" "04" "99" "92"

Combine Price Portions

amazon_price <- as.numeric(paste(amazon_price_whole, amazon_price_fraction, sep = ""))
head(amazon_price)
#> [1] 45.36 39.49 24.99 33.04 29.99 23.92

Rating

Scraping the Book Rating

amazon %>% 
  html_nodes("i.a-icon.a-icon-star-small.aok-align-bottom") %>% 
  html_text() -> amazon_rating
head(amazon_rating)
#> [1] "3.3 out of 5 stars" "4.7 out of 5 stars" "4.3 out of 5 stars"
#> [4] "4.7 out of 5 stars" "4.5 out of 5 stars" "5.0 out of 5 stars"

Let’s trim this into a usable metric

amazon_rating <- as.numeric(substr(amazon_rating, 1, 3)) # Takes 3 characters starting at 1
head(amazon_rating)
#> [1] 3.3 4.7 4.3 4.7 4.5 5.0

Rating Counts

Scraping the Book Rating Count

This element is messier and we’ll need a number of cleaning steps.

amazon %>% 
  html_nodes("div.a-row.a-size-small") %>% 
  html_text() -> amazon_rate_n
head(amazon_rate_n)
#> [1] "\n\n\n\n    \n\n\n\n\n\n\n    \n        \n            \n            3.3 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                13\n            \n        \n        \n    \n\n\n\n" 
#> [2] "\n\n\n\n    \n\n\n\n\n\n\n    \n        \n            \n            4.7 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                426\n            \n        \n        \n    \n\n\n\n"
#> [3] "\n\n\n\n    \n\n\n\n\n\n\n    \n        \n            \n            4.3 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                75\n            \n        \n        \n    \n\n\n\n" 
#> [4] "\n\n\n\n    \n\n\n\n\n\n\n    \n        \n            \n            4.7 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                14\n            \n        \n        \n    \n\n\n\n" 
#> [5] "\n\n\n\n    \n\n\n\n\n\n\n    \n        \n            \n            4.5 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                255\n            \n        \n        \n    \n\n\n\n"
#> [6] "\n\n\n\n    \n\n\n\n\n\n\n    \n        \n            \n            5.0 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                1\n            \n        \n        \n    \n\n\n\n"

Clean up Rating Count - Trim

amazon_rate_n <- str_trim(amazon_rate_n)    # trim \n & ' '
head(amazon_rate_n)
#> [1] "3.3 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                13" 
#> [2] "4.7 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                426"
#> [3] "4.3 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                75" 
#> [4] "4.7 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                14" 
#> [5] "4.5 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                255"
#> [6] "5.0 out of 5 stars\n        \n    \n    \n\n\n\n\n\n\n\n    \n\n\n\n\n\n    \n        \n            \n                1"

Clean up Rating Count - Substring

amazon_rate_n <- str_sub(amazon_rate_n, -5) # keep last 5 characters
head(amazon_rate_n)
#> [1] "   13" "  426" "   75" "   14" "  255" "    1"

Clean up Rating Count - Trim (Again)

amazon_rate_n <- str_trim(amazon_rate_n)    # trim leading spaces
head(amazon_rate_n)
#> [1] "13"  "426" "75"  "14"  "255" "1"

Clean up Rating Count - Set as Numeric

amazon_rate_n <- as.numeric(amazon_rate_n)
head(amazon_rate_n)
#> [1]  13 426  75  14 255   1

Publication Date

Scraping the Book Publication Date

amazon %>% 
  html_nodes("span.a-size-base.a-color-secondary.a-text-normal") %>% 
  html_text() -> amazon_pub_dt
head(amazon_pub_dt)
#> [1] "Mar 10, 2015" "Jan 10, 2017" "Jul 16, 2016" "Nov 30, 2018" "Apr 5, 2012" 
#> [6] "May 7, 2020"

We need to convert this to a date to allow easier analysis

amazon_pub_dt <- as.Date(amazon_pub_dt, "%b %d, %Y")
head(amazon_pub_dt)
#> [1] "2015-03-10" "2017-01-10" "2016-07-16" "2018-11-30" "2012-04-05"
#> [6] "2020-05-07"

We Have the Pieces

Let’s assemble the file!

  1. Titles
  2. Formats
  3. Prices
  4. Ratings
  5. Rating Counts
  6. Publication Date

Let’s Check the Scrapes

length(amazon_titles)
#> [1] 19
length(amazon_format)
#> [1] 39
length(amazon_price)
#> [1] 39
length(amazon_rating)
#> [1] 17
length(amazon_rate_n)
#> [1] 17
length(amazon_pub_dt)
#> [1] 19

Wait! What?!?

An issue with scraping is sometimes you get an uneven number of records due to missing data elements.

We can fix this!

  • …manually…

Fixing the Scrapes

Titles

All titles were populated and scraped accurately. However, due to multiple formats, these records must be repeated to fill the dataframe.

amazon_titles %>% 
  append(values = amazon_titles[17], after = 17) %>% # R Companion
  append(values = amazon_titles[16], after = 16) %>% # R for Dummies
  append(values = amazon_titles[15], after = 15) %>% # Linear Models
  append(values = amazon_titles[14], after = 14) %>% # R Cookbook
  append(values = amazon_titles[13], after = 13) %>% # Intro to Stat Learn
  append(values = amazon_titles[12], after = 12) %>% # Learning R
  append(values = amazon_titles[11], after = 11) %>% # Baseball with R
  append(values = amazon_titles[11], after = 11) %>% # Baseball with R
  append(values = amazon_titles[10], after = 10) %>% # Stats with R
  append(values = amazon_titles[9], after = 9) %>%   # Hands-On R
  append(values = amazon_titles[8], after = 8) %>%   # Advanced R
  append(values = amazon_titles[7], after = 7) %>%   # Advanced R
  append(values = amazon_titles[6], after = 6) %>%   # Interactive Shiny
  append(values = amazon_titles[6], after = 6) %>%   # Interactive Shiny
  append(values = amazon_titles[5], after = 5) %>%   # GLM
  append(values = amazon_titles[4], after = 4) %>%   # Discovering Stats
  append(values = amazon_titles[4], after = 4) %>%   # Discovering Stats
  append(values = amazon_titles[3], after = 3) %>%   # R Graphics
  append(values = amazon_titles[2], after = 2) %>%   # Book of R
  append(values = amazon_titles[1], after = 1) -> amazon_titles # R4DS
length(amazon_titles)
#> [1] 39

Formats

Nothing needed here!

length(amazon_format)
#> [1] 39

Prices

Or here!

length(amazon_price)
#> [1] 39

Ratings

Some books do not have ratings. A book only has one rating even if it has multiple formats.

For example, the 6th and 9th book do not have ratings.

We must also account for multiple formats.

amazon_rating %>% 
  append(values = NA, after = 5) %>% 
  append(values = NA, after = 8) -> amazon_rating
length(amazon_rating)
#> [1] 19

Ratings

Like titles, the ratings need to be repeated to show on the correct row.

The same corrections are done here.

amazon_rating %>% 
  append(values = amazon_rating[17], after = 17) %>% # R Companion
  append(values = amazon_rating[16], after = 16) %>% # R for Dummies
  append(values = amazon_rating[15], after = 15) %>% # Linear Models
  append(values = amazon_rating[14], after = 14) %>% # R Cookbook
  append(values = amazon_rating[13], after = 13) %>% # Intro to Stat Learn
  append(values = amazon_rating[12], after = 12) %>% # Learning R
  append(values = amazon_rating[11], after = 11) %>% # Baseball with R
  append(values = amazon_rating[11], after = 11) %>% # Baseball with R
  append(values = amazon_rating[10], after = 10) %>% # Stats with R
  append(values = amazon_rating[9], after = 9) %>%   # Hands-On R
  append(values = amazon_rating[8], after = 8) %>%   # Advanced R
  append(values = amazon_rating[7], after = 7) %>%   # Advanced R
  append(values = amazon_rating[6], after = 6) %>%   # Interactive Shiny
  append(values = amazon_rating[6], after = 6) %>%   # Interactive Shiny
  append(values = amazon_rating[5], after = 5) %>%   # GLM
  append(values = amazon_rating[4], after = 4) %>%   # Discovering Stats
  append(values = amazon_rating[4], after = 4) %>%   # Discovering Stats
  append(values = amazon_rating[3], after = 3) %>%   # R Graphics
  append(values = amazon_rating[2], after = 2) %>%   # Book of R
  append(values = amazon_rating[1], after = 1) -> amazon_rating # R4DS
length(amazon_rating)
#> [1] 39

Rating Counts

Not all titles have a rating, specifically 5 and 7

amazon_rate_n %>% 
  append(values = NA, after = 4) %>% 
  append(values = NA, after = 6) -> amazon_rate_n
length(amazon_rate_n)
#> [1] 19

Rating Counts

We must also account for multiple formats.

amazon_rate_n %>% 
  append(values = amazon_rate_n[17], after = 17) %>% # R Companion
  append(values = amazon_rate_n[16], after = 16) %>% # R for Dummies
  append(values = amazon_rate_n[15], after = 15) %>% # Linear Models
  append(values = amazon_rate_n[14], after = 14) %>% # R Cookbook
  append(values = amazon_rate_n[13], after = 13) %>% # Intro to Stat Learn
  append(values = amazon_rate_n[12], after = 12) %>% # Learning R
  append(values = amazon_rate_n[11], after = 11) %>% # Baseball with R
  append(values = amazon_rate_n[11], after = 11) %>% # Baseball with R
  append(values = amazon_rate_n[10], after = 10) %>% # Stats with R
  append(values = amazon_rate_n[9], after = 9) %>%   # Hands-On R
  append(values = amazon_rate_n[8], after = 8) %>%   # Advanced R
  append(values = amazon_rate_n[7], after = 7) %>%   # Advanced R
  append(values = amazon_rate_n[6], after = 6) %>%   # Interactive Shiny
  append(values = amazon_rate_n[6], after = 6) %>%   # Interactive Shiny
  append(values = amazon_rate_n[5], after = 5) %>%   # GLM
  append(values = amazon_rate_n[4], after = 4) %>%   # Discovering Stats
  append(values = amazon_rate_n[4], after = 4) %>%   # Discovering Stats
  append(values = amazon_rate_n[3], after = 3) %>%   # R Graphics
  append(values = amazon_rate_n[2], after = 2) %>%   # Book of R
  append(values = amazon_rate_n[1], after = 1) -> amazon_rate_n # R4DS
length(amazon_rate_n)
#> [1] 39

Publication Date

Create extra rows due to multiple book formats.

amazon_pub_dt %>% 
  append(values = amazon_pub_dt[17], after = 17) %>% # R Companion
  append(values = amazon_pub_dt[16], after = 16) %>% # R for Dummies
  append(values = amazon_pub_dt[15], after = 15) %>% # Linear Models
  append(values = amazon_pub_dt[14], after = 14) %>% # R Cookbook
  append(values = amazon_pub_dt[13], after = 13) %>% # Intro to Stat Learn
  append(values = amazon_pub_dt[12], after = 12) %>% # Learning R
  append(values = amazon_pub_dt[11], after = 11) %>% # Baseball with R
  append(values = amazon_pub_dt[11], after = 11) %>% # Baseball with R
  append(values = amazon_pub_dt[10], after = 10) %>% # Stats with R
  append(values = amazon_pub_dt[9], after = 9) %>%   # Hands-On R
  append(values = amazon_pub_dt[8], after = 8) %>%   # Advanced R
  append(values = amazon_pub_dt[7], after = 7) %>%   # Advanced R
  append(values = amazon_pub_dt[6], after = 6) %>%   # Interactive Shiny
  append(values = amazon_pub_dt[6], after = 6) %>%   # Interactive Shiny
  append(values = amazon_pub_dt[5], after = 5) %>%   # GLM
  append(values = amazon_pub_dt[4], after = 4) %>%   # Discovering Stats
  append(values = amazon_pub_dt[4], after = 4) %>%   # Discovering Stats
  append(values = amazon_pub_dt[3], after = 3) %>%   # R Graphics
  append(values = amazon_pub_dt[2], after = 2) %>%   # Book of R
  append(values = amazon_pub_dt[1], after = 1) -> amazon_pub_dt # R4DS
length(amazon_pub_dt)
#> [1] 39

One More Check!

length(amazon_titles)
#> [1] 39
length(amazon_format)
#> [1] 39
length(amazon_price)
#> [1] 39
length(amazon_rating)
#> [1] 39
length(amazon_rate_n)
#> [1] 39
length(amazon_pub_dt)
#> [1] 39

(Finally) Assemble the Data

r_books <- tibble(title = amazon_titles,
                  text_format = amazon_format,
                  price = amazon_price,
                  rating = amazon_rating,
                  num_ratings = amazon_rate_n,
                  publication_date = amazon_pub_dt)
head(r_books)
#> # A tibble: 6 x 6
#>   title                    text_format price rating num_ratings publication_date
#>   <chr>                    <chr>       <dbl>  <dbl>       <dbl> <date>          
#> 1 Mastering R for Quantit~ Paperback    45.4    3.3          13 2015-03-10      
#> 2 Mastering R for Quantit~ Paperback    39.5    3.3          13 2015-03-10      
#> 3 R for Data Science: Imp~ Kindle       25.0    4.7         426 2017-01-10      
#> 4 R for Data Science: Imp~ Paperback    33.0    4.7         426 2017-01-10      
#> 5 The Book of R: A First ~ eTextbook    30.0    4.3          75 2016-07-16      
#> 6 The Book of R: A First ~ Paperback    23.9    4.3          75 2016-07-16

Thank you


@mjhendrickson


matthewjhendrickson


mjhendrickson


Web Scraping in R & rvest repo

This talk is freely distributed under the MIT License.

References & Resources

References & Resources continued